Documents as multiple overlapping windows into a grid of counts

نویسندگان

  • Alessandro Perina
  • Nebojsa Jojic
  • Manuele Bicego
  • Andrzej Turski
چکیده

In text analysis documents are often represented as disorganized bags of words; models of such count features are typically based on mixing a small number of topics [1,2]. Recently, it has been observed that for many text corpora documents evolve into one another in a smooth way, with some features dropping and new ones being introduced. The counting grid [3] models this spatial metaphor literally: it is a grid of word distributions learned in such a way that a document’s own distribution of features can be modeled as the sum of the histograms found in a window into the grid. The major drawback of this method is that it is essentially a mixture and all the content must be generated by a single contiguous area on the grid. This may be problematic especially for lower dimensional grids. In this paper, we overcome this issue by introducing the Componential Counting Grid which brings the componential nature of topic models to the basic counting grid. We evaluated our approach on document classification and multimodal retrieval obtaining state of the art results on standard benchmarks.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Documents as multiple overlapping windows into grids of counts

In text analysis documents are often represented as disorganized bags of words; models of such count features are typically based on mixing a small number of topics [1,2]. Recently, it has been observed that for many text corpora documents evolve into one another in a smooth way, with some features dropping and new ones being introduced. The counting grid [3] models this spatial metaphor litera...

متن کامل

یک مدل موضوعی احتمالاتی مبتنی بر روابط محلّی واژگان در پنجره‌های هم‌پوشان

A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution o...

متن کامل

Monte Carlo Study of the Effect of Backscatter Materail Thickness on 99mTc Source Response in Single Photon Emission Computed Tomography

Introduction SPECT projections are contaminated by scatter radiation, resulting in reduced image contrast and quantitative errors. Backscatter constitutes a major part of the scatter contamination in lower energy windows. The current study is an evaluation of the effect of backscatter material on FWHM and image quality investigated by Monte Carlo simulation. Materials and Methods SIMIND program...

متن کامل

Accurate Supervised and Semi-Supervised Machine Reading for Long Documents

We introduce a hierarchical architecture for machine reading capable of extracting precise information from long documents. The model divides the document into small, overlapping windows and encodes all windows in parallel with an RNN. It then attends over these window encodings, reducing them to a single encoding, which is decoded into an answer using a sequence decoder. This hierarchical appr...

متن کامل

Power Management in a Utility Connected Micro-Grid with Multiple Renewable Energy Sources

As an efficient alternative to fossil fuels, renewable energy sources have attained great attention due to their sustainable, cost-effective, and environmentally friendly characteristic. However, as a deficiency, renewable energy sources have low reliability because of their non-deterministic and stochastic generation pattern. The use of hybrid renewable generation systems along with the storag...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013